94 ◾ Bioinformatics
used for the de novo assembly. For the purpose of the practice, we will use paired-end short
reads produced by Illumina MiSeq for whole genome sequencing of Escherichia coli (str.
K-12). The files (forward and reverse FASTQ files) are available at the NCBI SRA database.
First, using the Linux terminal, create a directory for the exercise and change to it.
mkdir denovo; cd denovo
Inside that directory, create a subdirectory for the FASTQ files.
mkdir fastq; cd fastq
Then, you can download the raw data files into “fastq” directory using SRA toolkits as
follows:
fasterq-dump --threads 4 --verbose ERR1007381
To save some storage space, you can compress the two FASTQ files with GZIP utility as
follows:
gzip ERR1007381_1.fastq
gzip ERR1007381_2.fastq
The compression will reduce the storage of the two files from 11 G to 3 G.
We will use “abyss-pe” command to perform the de novo genome assembly. Change to
the main exercise directory just a single step out of “fastq” directory by using “cd ..”. Then,
run the following command to construct contigs:
abyss-pe \
name=ecoli \
j=4 \
k=25 \
c=360 \
e=2 \
s=200 \
v=-v \
in=’fastq/ERR1007381_1.fastq.gz fastq/ERR1007381_2.fastq.gz’ \
contigs \
2>&1 | tee abyss.log
The following will construct scaffolds from the contigs:
abyss-pe \
name=ecoli \
j=4 \
k=25 \
c=360 \